Statistical learning: unsupervised learning

MACS 30100
University of Chicago

March 8, 2017

Unsupervised learning

  • Supervised learning
  • Unsupervised learning
    • Exploration

Clustering methods

  • Clustering
  • \(K\)-means clustering
  • Hierarchical clustering

\(K\)-means clustering

\(K\)-means clustering

  • \(C_1, C_2, \dots, C_K\): indicies of observations in each cluster
  • Minimize the within-cluster variation

    \[\min_{C_1, C_2, \dots, C_K} \left\{ \sum_{k = 1}^K W(C_k) \right\}\]

  • Squared Euclidean distance

    \[W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k} \sum_{j = 1}^p (x_{ij} - x_{i'j})^2\]

  • Global vs. local optimum

\(K\)-means clustering

  1. Randomly assign each observation to one of \(K\) clusters
  2. For each of the \(K\) clusters:
    1. Compute the cluster centroid
    2. Reassign observations
  3. Rinse and repeat until stable

\(K\)-means clustering

Hierarchical clustering

  • Fixed \(K\)
  • Hierarchical clustering
  • Dendrograms

Interpreting dendrograms

Interpreting dendrograms

Interpreting dendrograms

Interpreting dendrograms

Estimating hierarchical clusters

  1. Assume each \(n\) observation is its own cluster and calculate pairwise dissimilarities
  2. For \(i=n, n-1, \dots, 2\):
    1. Identify least dissimilar pair of clusters and fuse them
    2. Compute the new pairwise inter-cluster dissimilarities among the \(i-1\) clusters
  3. Rinse and repeat until only a single cluster remains

Linkage

  • Complete
  • Single
  • Average
  • Centroid

Linkage

Linkage

Linkage

Dimension reduction

  • Visualize data with lots of variables
  • Use variables in supervised learning, but you have too many
  • Identify a small number of variables that collectively explain most of the variablity in the original data

DW-NOMINATE

  • Studying legislative voting
  • Roll-call votes
  • Multidimensional scaling techniques
  • DW-NOMINATE dimensions
    • Ideology (liberal-conservative)
    • Issue of the day

Voteview.org

Voteview.org

Voteview.org

Voteview.org

Principal components analysis

  • Find a low-dimensional representation of the data that contains as much as possible of the variation
  • Dimensions are linear combinations of the variables

First principal component

\[Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \dots + \phi_{p1}X_p\]

\[\sum_{j=1}^p \phi_{j1}^2 = 1\]

Estimation procedure

\[\max_{\phi_{11}, \dots, \phi_{p1}} \left \{ \frac{1}{n} \sum_{i=1}^n z_{i1}^2 \right \}\]

Such that \(\sum_{j=1}^p \phi_{j1}^2 = 1\)

Result of estimation

  • Loadings: \(\phi_{11}, \dots, \phi_{p1}\)
  • Scores: \(z_{11}, z_{21}, \dots, z_{n1}\)
  • Estimating second PC, third PC, etc.

Biplot of USArrests

Bag-of-words

 a abandoned abc ability able about above abroad absorbed absorbing abstract
43         0   0       0    0    10     0      0        0         0        1
  1. Sparsity
  2. Stop words
  3. Correlation between words

Latent semantic analysis

  • Identify words closely related to one another
  • Makes searching by keyword easier
  • Uses PCA or similar techinques

NYTimes

##  [1] "penchant"  "brought"   "structure" "willing"   "yielding" 
##  [6] "bare"      "school"    "halls"     "challenge" "step"     
## [11] "largest"   "lovers"    "intense"   "borders"   "mall"     
## [16] "classic"   "conducted" "mirrors"   "hole"      "location" 
## [21] "desperate" "published" "head"      "paints"    "another"  
## [26] "starts"    "familiar"  "window"    "thats"     "broker"

NYTimes

NYTimes

NYTimes

NYTimes

## List of 1
##  $ legend.position: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Topic modeling

  • Themes
  • Probabilistic topic models
  • Latent Dirichlet allocation

Food and animals

  1. I ate a banana and spinach smoothie for breakfast.
  2. I like to eat broccoli and bananas.
  3. Chinchillas and kittens are cute.
  4. My sister adopted a kitten yesterday.
  5. Look at this cute hamster munching on a piece of broccoli.

LDA document structure

  • Decide on the number of words N the document will have
  • Generate each word in the document:
    • Pick a topic
    • Generate the word
  • LDA backtracks from this assumption

Food and animals

  • Decide that \(D\) will be 1/2 about food and 1/2 about cute animals.
  • Pick 5 to be the number of words in \(D\).
  • Pick the first word to come from the food topic
  • Pick the second word to come from the cute animals topic
  • Pick the third word to come from the cute animals topic
  • Pick the fourth word to come from the food topic
  • Pick the fifth word to come from the food topic

How does LDA learn?

  • Randomly assign each word in the document to one of \(K\) topics
  • For each document \(d\):
    • Go through each word \(w\) in \(d\)
      • And for each topic \(t\), compute two things:
        1. \(p(t | d)\)
        2. \(p(w | t)\)
      • Reassign \(w\) a new topic \(t\) with probability \(p(t|d) \times p(w|t)\)
  • Rinse and repeat

  • Estimate from LDA
    1. The topic mixtures of each document
    2. The words associated to each topic

Associated Press articles

## <<DocumentTermMatrix (documents: 2246, terms: 10134)>>
## Non-/sparse entries: 259208/22501756
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

Perplexity

  • A statistical measure of how well a probability model predicts a sample
  • Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
  • Perplexity for LDA model with 12 topics
    • 2264.508

Topics from \(k=100\)